Multi-word Term Extraction Based on New Hybrid Approach for Arabic Language
نویسندگان
چکیده
Arabic Multiword Term are relevant strings of words in text documents. Once they are automatically extracted, they can be used to increase the performance of any text mining applications such as Categorisation, Clustering, Information Retrieval System, Machine Translation, and Summarization, etc. This paper introduces our proposed Multiword term extraction system based on the contextual information. In fact, we propose a new method based a hybrid approach for Arabic Multiword term extraction. Like other method based on hybrid approach, our method is composed by two main steps: the Linguistic approach and the Statistical one. In the first step, the Linguistic approach uses Part Of Speech (POS) Tagger (Taani’s Tagger) and the Sequence Identifier as patterns in order to extract the candidate AMTWs. While in the second one which includes our main contribution, the Statistical approach incorporates the contextual information by using a new proposed association measure based on Termhood and Unithood for AMWTs extraction. To evaluate the efficiency of our proposed method for AMWTs extraction, this later has been tested and compared using three different association measures: the proposed one named NTC-Value, NC-Value, and C-Value. The experimental results using Arabic Texts taken from the environment domain, show that our hybrid method outperforms the other ones in term of precision, in addition, it can deal correctly with tri-gram Arabic Multiword terms.
منابع مشابه
A Multi-Word Term Extraction Program for Arabic Language
Terminology extraction commonly includes two steps: identification of term-like units in the texts, mostly multi-word phrases, and the ranking of the extracted term-like units according to their domain representativity. In this paper, we design a multi-word term extraction program for Arabic language. The linguistic filtering performs a morphosyntactic analysis and takes into account several ty...
متن کاملA Study of Association Measures and their Combination for Arabic MWT Extraction
Automatic Multi-Word Term (MWT) extraction is a very important issue to many applications, such as information retrieval, question answering, and text categorization. Although many methods have been used for MWT extraction in English and other European languages, few studies have been applied to Arabic. In this paper, we propose a novel, hybrid method which combines linguistic and statistical a...
متن کاملIdentifying Contextual Information for Multi-Word Term Extraction
Methods for multi-word term extraction have traditionally involved statistical techniques. More recently, hybrid techniques have been evolving which incorporate some linguistic knowledge. This information is generally very shallow, and researchers have tended to ignore any real understanding of either terms or the context in which they appear. We adopt an approach which uses a variety of knowle...
متن کاملA Hybrid Machine Translation System Based on a Monotone Decoder
In this paper, a hybrid Machine Translation (MT) system is proposed by combining the result of a rule-based machine translation (RBMT) system with a statistical approach. The RBMT uses a set of linguistic rules for translation, which leads to better translation results in terms of word ordering and syntactic structure. On the other hand, SMT works better in lexical choice. Therefore, in our sys...
متن کاملMulti-word term extraction from comparable corpora by combining contextual and constituent clues
In this paper we present an approach to automatically extract and align multi-word terms from an English-Slovene comparable health corpus. First, the terms are extracted from the corpus for each language separately using a list of user-adjustable morphosyntactic patterns and a term weighting measure. Then, the extracted terms are aligned in a bag-of-equivalents fashion with a seed bilingual lex...
متن کامل